Summary of project

Bellabeat is a high-tech manufacturer of health-focused products for women. it is a successful small company, but they have the potential to become a larger player in the global smart device market, for mor information about the company, click here. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. She asked marketing team to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices.
In this case study I assumed, I’m a jonior data analyst who is working for Bellabeat marketing team. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.

Business Task

Identifying trends in non-Bellabeat smart device usage and focus on a Bellabeat product, Then, using this information, provide high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Stakeholders

  • Urška Sršen - Bellabeat cofounder and Chief Creative Officer.
  • Sando Mur - Bellabeat cofounder and key member of Bellabeat executive team.
  • Bellabeat Marketing Analytics team.

Preparing for analysis

Sršen encourages analytics team to use public data that explores smart device users’ daily habit. She points team to FitBit Fitness Tracker Data, This Kaggle data set dataset, made available through Mobius, contains personal fitness tracker from thirty fitbit users, including minute-level output for physical activity, heart rate, and sleep monitoring.

Determining the credibility of the data

FitBit Fitness Tracker Data generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 to 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. This is a third party public data set, with small sample size, no demographic information, no gender information and out of date, which could lead to bias, but still has alot of informations about 30 Fitbit user which can be useful for our analysis.

Processing

I will fulfill my analysis in RStudio. I am using R Markdown to demonstrate the steps of this analysis and create this notebook.

Installing and Loading required packages

install.packages("tidyverse") install.packages("plotly")

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Importing Datasets

daily_activity <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heart_rate <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Previewing the datasets

head(daily_activity)
## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance, ⁹​SedentaryActiveDistance
head(daily_sleep)
## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹​TotalTimeInBed
head(heart_rate)
## # A tibble: 6 × 3
##           Id Time                 Value
##        <dbl> <chr>                <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95
head(hourly_steps)
## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151
## 4 1503960366 4/12/2016 3:00:00 AM          0
## 5 1503960366 4/12/2016 4:00:00 AM          0
## 6 1503960366 4/12/2016 5:00:00 AM          0

Lets see how many unique participants there are in each dataframe. It looks like there may be more participants in the daily activity dataset than the sleep dataset.

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_sleep$Id)
## [1] 24
n_distinct(heart_rate$Id)
## [1] 14
n_distinct(hourly_steps$Id)
## [1] 33

There are 33 participants in daily activity and hourly steps data frames, 24 in daily sleep and only 14 in heart rate data set.

Cleaning data frames

First of all, I would like to check for duplicated observations.

sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
sum(duplicated(heart_rate))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0

There are 3 duplicated values in daily sleep data set. I going to remove them.

daily_sleep <- distinct(daily_sleep)

The dates in all for datasets were formatted as string (chr) and need to converted to date format before starting the analysis. also, I will rename these columns to date to increase consistency.

daily_activity <- daily_activity %>% 
  mutate(ActivityDate = mdy(ActivityDate)) %>% 
  rename(date = ActivityDate)

Time stamps in daily sleep, hourly steps and heart rate data frames were formatted as string too, I will convert them to Time-Date format and then, will split them to date and time columns.

daily_sleep$SleepDay=as.POSIXct(daily_sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()) 
daily_sleep <- separate(daily_sleep, SleepDay, into=c('date', 'time'), sep=' ', remove=TRUE) %>% 
  mutate(date=as_date(date), time=hms::as_hms(time))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 410 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
hourly_steps$ActivityHour=as.POSIXct(hourly_steps$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()) 
hourly_steps <- separate(hourly_steps, ActivityHour, into=c('date', 'time'), sep=' ', remove=TRUE) %>% 
  mutate(date=as_date(date), time=hms::as_hms(time))
 
heart_rate$Time=as.POSIXct(heart_rate$Time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
heart_rate <- separate(heart_rate, Time, into=c('date', 'time'), sep =' ', remove=TRUE) %>% 
  mutate(date=as_date(date), time=hms::as_hms(time))

I would like to confirm format corrections by by running STR() function.

str(daily_activity)
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date                    : Date[1:940], format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
str(daily_sleep)
## tibble [410 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Id                : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date              : Date[1:410], format: "2016-04-12" "2016-04-13" ...
##  $ time              : 'hms' num [1:410] NA NA NA NA ...
##   ..- attr(*, "units")= chr "secs"
##  $ TotalSleepRecords : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...
str(heart_rate)
## tibble [2,483,658 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Id   : num [1:2483658] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ date : Date[1:2483658], format: "2016-04-12" "2016-04-12" ...
##  $ time : 'hms' num [1:2483658] 07:21:00 07:21:05 07:21:10 07:21:20 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Value: num [1:2483658] 97 102 105 103 101 95 91 93 94 93 ...
str(hourly_steps)
## tibble [22,099 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Id       : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date     : Date[1:22099], format: "2016-04-12" "2016-04-12" ...
##  $ time     : 'hms' num [1:22099] 00:00:00 01:00:00 02:00:00 03:00:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ StepTotal: num [1:22099] 373 160 151 0 0 ...

now that the data is cleaned, I’m ready to analysis the data sets.

Analysing the data

Let’s start analyzing our data with a sneak peek into summary statistics.

daily_activity %>% 
  select(TotalSteps, TotalDistance, SedentaryMinutes, Calories) %>% 
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900
daily_sleep %>% 
  select(TotalMinutesAsleep, TotalSleepRecords, TotalTimeInBed) %>% 
  summary()
##  TotalMinutesAsleep TotalSleepRecords TotalTimeInBed 
##  Min.   : 58.0      Min.   :1.00      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:1.00      1st Qu.:403.8  
##  Median :432.5      Median :1.00      Median :463.0  
##  Mean   :419.2      Mean   :1.12      Mean   :458.5  
##  3rd Qu.:490.0      3rd Qu.:1.00      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :3.00      Max.   :961.0
heart_rate %>% 
  select(Value) %>% 
  summary()
##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.33  
##  3rd Qu.: 88.00  
##  Max.   :203.00
hourly_steps %>% 
  select(StepTotal) %>% 
  summary()
##    StepTotal      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :   40.0  
##  Mean   :  320.2  
##  3rd Qu.:  357.0  
##  Max.   :10554.0

Plotting a few explorations

ggplot(data=daily_activity, mapping = aes(x = TotalSteps, y = Calories)) +
  geom_point(color = "blue") +
  geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Total Steps vs. Calories",
                       x = "Total Steps", y = "Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Number of steps clearly correlated to number of burned calories. Let’s take a look to relationship between time spent in bed to total sleep time per day.

ggplot(data=daily_sleep, mapping = aes(x = TotalMinutesAsleep, y = TotalTimeInBed)) +
  geom_point(color = "blue") +
  geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Total Sleep vs. Total Time In Bed",
                                      x = "Total Asleep (Minutes)", y = "Total Time In Bed (Minutes)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As it was expected, it’s almost completely linear.

hourly_steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(StepTotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "FitBit Tracker Data", subtitle = "Hourly Steps Per Day", x="Time", y="Average Steps") + 
  scale_fill_gradient(low = "black", high = "navy", name = "Average Steps") +
  theme(axis.text.x = element_text(angle = 45))

It’s clear that our participants are more active between 5PM to 7PM. probably they go to gym or maybe a walk after work. it’s interesting to see 11AM to 2PM are very active hours as well. Now, I would like to explore the relationship between exercise and sleep. in order to check to see if there is any correlation between them, I need to join daily_activity and daily_sleep data sets.

daily_activity_sleep <- merge(daily_activity, daily_sleep, by=c('Id', 'date'))

Let’s take a look to our merged data set.

str(daily_activity_sleep)
## 'data.frame':    410 obs. of  19 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date                    : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num  13162 10735 9762 12669 9705 ...
##  $ TotalDistance           : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ TrackerDistance         : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.14 2.71 3.19 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 1.26 0.41 0.78 ...
##  $ LightActiveDistance     : num  6.06 4.71 2.83 5.04 2.51 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  25 21 29 36 38 50 28 19 41 39 ...
##  $ FairlyActiveMinutes     : num  13 19 34 10 20 31 12 8 21 5 ...
##  $ LightlyActiveMinutes    : num  328 217 209 221 164 264 205 211 262 238 ...
##  $ SedentaryMinutes        : num  728 776 726 773 539 775 818 838 732 709 ...
##  $ Calories                : num  1985 1797 1745 1863 1728 ...
##  $ time                    : 'hms' num  NA NA NA NA ...
##   ..- attr(*, "units")= chr "secs"
##  $ TotalSleepRecords       : num  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep      : num  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed          : num  346 407 442 367 712 320 377 364 384 449 ...

As it was expected, due to inner join of the data sets, it has 410 observations.

ggplot(data=daily_activity_sleep, mapping = aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_point(color = "blue") +
  geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Total Steps vs. Total Sleep Time",
                                      x = "Total Average Steps Per Day", y = "Total Sleep Time (Minutes)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

It’s look like there is not significant correlation between average daily steps and sleep duration. Let’s check relation of sedentary and sleep.

ggplot(data=daily_activity_sleep, mapping = aes(y = SedentaryMinutes, x = TotalMinutesAsleep)) +
  geom_point(color = "blue") +
  geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Average Sedentary vs. Total Sleep Time",
                                      y = "Average Sedentary Minutes Per Day", x = "Total Sleep Time (Minutes)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Obviously, there is a negative correlation between sedentary and sleep. it means people with more sedentary minutes per day tend to have less sleep during night. a fitness smart device like Fitbit or Bellabeat leaf can encourage users to excersice for healthy life style.
I realized that some participants didn’t wear their smart device some days. now, I’m curious to know what is the smart device usage percent among the owners.

usage <- daily_activity_sleep %>%
  group_by(Id) %>%
  summarize(worn_days=sum(n())) %>%
  mutate(fitbit_usage = case_when(
    worn_days >= 1 & worn_days <= 6 ~ "Very low usage",
    worn_days >= 7 & worn_days <= 12 ~ "Low usage",
    worn_days >= 13 & worn_days <= 18 ~ "Moderate usage",
    worn_days >= 19 & worn_days <= 24 ~ "High usage", 
    worn_days >= 25 & worn_days <= 31 ~ "Very high usage"))
 usage_percentage <- usage %>% 
   group_by(fitbit_usage) %>% 
   summarise(total_usage_type = n()) %>% 
   mutate(total_number_of_use = sum(total_usage_type)) %>% 
   group_by(fitbit_usage) %>% 
   summarise(percentage = total_usage_type*100/total_number_of_use)

lets take a look to our percentage tibble.

print(usage_percentage)
## # A tibble: 5 × 2
##   fitbit_usage    percentage
##   <chr>                <dbl>
## 1 High usage            8.33
## 2 Low usage             4.17
## 3 Moderate usage       12.5 
## 4 Very high usage      41.7 
## 5 Very low usage       33.3

Let’s make a pie chart to visualize this data. I’m going to load plotly package to create our pie chart.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
labels = c('High usage','Low usage','Moderate usage','Very high usage', 'Very low usage ')
values = c(8.33, 4.17, 12.5 , 41.7, 33.3)

fig <- plot_ly(type='pie', labels=labels, values=values, 
               textinfo='label+percent',
               insidetextorientation='radial') %>%
  layout(title = 'Smart device usage per month by owners')
fig

Summarizing conclusions and recommendations

I appreciate your interest to my project. This is my first ever case study in data analytics. I’m eager to hear any comments or recommendations about it.